feat: add Application Insights for OTEL tracing and uat->staging rename#68
feat: add Application Insights for OTEL tracing and uat->staging rename#68JustAGhosT merged 11 commits intodevfrom
Conversation
Renamed all infrastructure files from UAT to staging and updated all environment references in configuration files and validation rules to use "staging" instead of "uat".
Replaces all instances of "uat" with "staging" in deployment workflow and documentation, including GitHub Actions job names, environment references, and PR label triggers.
…ging rename - Add Application Insights resource for trace storage - Add appinsights connection string to Key Vault and Container App env - Add OTEL configuration for azure-monitor-opentelemetry exporter - Complete uat->staging rename across all docs and scripts - Update planning doc with completed Phase 1 & 2 status
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR renames the UAT environment to staging across CI/CD, docs, scripts, and Terraform; adds Application Insights telemetry (resource, Key Vault secret, container app env var, and sensitive output) to the aigateway Terraform module; and adds many new architecture/SLM documentation files. Changes
Sequence Diagram(s)sequenceDiagram
participant CI as CI (workflow)
participant ARM as Azure Resource Manager
participant KV as Key Vault
participant CA as Container App
participant App as Application
rect rgba(200,200,255,0.5)
CI->>ARM: create azurerm_application_insights
ARM-->>CI: return connection_string
CI->>KV: create secret(appinsights_connection_string)
KV-->>CI: confirm secret
CI->>ARM: create/update Container App referencing Key Vault secret
ARM-->>CA: deploy Container App with secret mapping
end
rect rgba(200,255,200,0.5)
CA->>App: container starts with APPLICATIONINSIGHTS_CONNECTION_STRING env
App->>ARM: send telemetry (OTEL → Application Insights)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
README.md (1)
107-107:⚠️ Potential issue | 🟡 MinorInconsistent reference to "UAT" in documentation link.
Line 107 still references "UAT toggle" while the rest of the PR renames UAT to staging. This appears to be an oversight in the documentation description.
-- [CI/CD Runbook](docs/CI_CD.md) – workflow behavior, UAT toggle, smoke tests +- [CI/CD Runbook](docs/CI_CD.md) – workflow behavior, staging toggle, smoke tests🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` at line 107, Update the README line that reads "[CI/CD Runbook](docs/CI_CD.md) – workflow behavior, UAT toggle, smoke tests" to use the new terminology "staging" instead of "UAT" (e.g., "...workflow behavior, staging toggle, smoke tests") so it matches the rest of the PR; ensure the phrase appears exactly in the README and the linked docs if there are other occurrences of "UAT" to rename consistently.docs/PRD.md (2)
170-170:⚠️ Potential issue | 🟡 MinorIncomplete uat→staging rename in acceptance criteria.
Line 170 still references "Dev/UAT/Prod" while the rest of the document has been updated to use "staging". This creates an inconsistency.
📝 Proposed fix
-3. Dev/UAT/Prod are reproducible via Terraform + Actions. +3. Dev/Staging/Prod are reproducible via Terraform + Actions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/PRD.md` at line 170, Replace the remaining "Dev/UAT/Prod" phrasing in the acceptance criteria with the consistent "Dev/staging/Prod" term (specifically update the item containing "Dev/UAT/Prod" to "Dev/staging/Prod"); also scan the PRD.md for any other occurrences of "UAT" or "UAT→staging" and update them to "staging" so the document uses a single environment name consistently.
183-183:⚠️ Potential issue | 🟡 MinorIncomplete uat→staging rename in milestones.
Line 183 still references "UAT" in the milestone description while the rest of the document uses "staging".
📝 Proposed fix
-- M2: UAT + Prod; environment approvals. +- M2: Staging + Prod; environment approvals.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/PRD.md` at line 183, The milestone text "M2: UAT + Prod; environment approvals." is inconsistent with the rest of the doc that uses "staging"—update the M2 milestone (look for the "M2" milestone heading/text) to read "M2: staging + Prod; environment approvals." and scan for any other lingering "UAT" references to replace them with "staging" to maintain consistency.
🧹 Nitpick comments (1)
docs/planning/request_to_token_attribution.md (1)
219-221: Minor: Consider tense consistency in dependencies.Line 221 states "Application Insights added for trace storage" using past tense, but this is the PR adding Application Insights according to the objectives. While this might be intentional (documenting the post-merge state), consider using present/future tense for clarity during review: "Application Insights being added for trace storage" or "Requires Application Insights for trace storage".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/planning/request_to_token_attribution.md` around lines 219 - 221, Change the tense for the infra dependency line "infra: Application Insights added for trace storage" to present/future tense to reflect that this PR is adding it; for example update the list item to "infra: Application Insights being added for trace storage" or "infra: Requires Application Insights for trace storage" so the dependency wording is consistent with the other lines like "cognitive-mesh: Must pass correlation metadata to gateway" and "pvc-costops-analytics: Must create KQL queries for new event shape".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy.yaml:
- Line 38: The comment on the deploy workflow is inconsistent: update the
comment that currently reads "Runtime UAT toggle: add PR label 'run-staging' to
enable UAT on PRs into main." so it references "run-staging" terminology; change
the phrase "Runtime UAT toggle" to language that matches the label (e.g.,
"Runtime staging toggle") so the comment aligns with the `run-staging` label and
the purpose described in the workflow.
In `@docs/planning/request_to_token_attribution.md`:
- Around line 210-215: The acceptance table incorrectly marks "100% include
workflow + stage" and "Support KQL joins by operation_Id/request_id" as ✅ Done
while downstream tasks show dependencies still outstanding; update the table
rows for "100% include workflow + stage" and "Support KQL joins by
operation_Id/request_id" to reflect the real state by either (A) changing the
Status cell to "🔜 Ready" or "⏳ In Progress" until cognitive-mesh begins passing
correlation metadata and pvc-costops-analytics finishes KQL queries, or (B)
rewording the criterion label to "Capability Available: workflow+stage metadata"
/ "Capability Available: KQL join support" to distinguish capability from
feature completion; ensure you reference the dependency names cognitive-mesh and
pvc-costops-analytics in the Notes column so the relationship is clear.
---
Outside diff comments:
In `@docs/PRD.md`:
- Line 170: Replace the remaining "Dev/UAT/Prod" phrasing in the acceptance
criteria with the consistent "Dev/staging/Prod" term (specifically update the
item containing "Dev/UAT/Prod" to "Dev/staging/Prod"); also scan the PRD.md for
any other occurrences of "UAT" or "UAT→staging" and update them to "staging" so
the document uses a single environment name consistently.
- Line 183: The milestone text "M2: UAT + Prod; environment approvals." is
inconsistent with the rest of the doc that uses "staging"—update the M2
milestone (look for the "M2" milestone heading/text) to read "M2: staging +
Prod; environment approvals." and scan for any other lingering "UAT" references
to replace them with "staging" to maintain consistency.
In `@README.md`:
- Line 107: Update the README line that reads "[CI/CD Runbook](docs/CI_CD.md) –
workflow behavior, UAT toggle, smoke tests" to use the new terminology "staging"
instead of "UAT" (e.g., "...workflow behavior, staging toggle, smoke tests") so
it matches the rest of the PR; ensure the phrase appears exactly in the README
and the linked docs if there are other occurrences of "UAT" to rename
consistently.
---
Nitpick comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 219-221: Change the tense for the infra dependency line "infra:
Application Insights added for trace storage" to present/future tense to reflect
that this PR is adding it; for example update the list item to "infra:
Application Insights being added for trace storage" or "infra: Requires
Application Insights for trace storage" so the dependency wording is consistent
with the other lines like "cognitive-mesh: Must pass correlation metadata to
gateway" and "pvc-costops-analytics: Must create KQL queries for new event
shape".
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb682266-666a-4fa8-8634-47a25a4eacc6
📒 Files selected for processing (27)
.github/actions/import-container-app/action.yml.github/pull_request_template.md.github/workflows/deploy.yamlREADME.mddashboard/app.jsdocs/AZURE_OIDC_SETUP.mddocs/CI_CD.mddocs/PRD.mddocs/SECRETS.mddocs/Terraform_Blueprint.mddocs/planning/request_to_token_attribution.mdinfra/env/dev/variables.tfinfra/env/prod/variables.tfinfra/env/staging/.terraform.lock.hclinfra/env/staging/main.tfinfra/env/staging/terraform.tfvarsinfra/env/staging/variables.tfinfra/modules/aigateway_aca/main.tfinfra/modules/aigateway_aca/outputs.tfinfra/modules/aigateway_aca/variables.tfinfra/modules/dashboard_aca/variables.tfinfra/modules/state_service_aca/variables.tfinfra/scripts/terraform-init.ps1infra/scripts/terraform-init.shscripts/add-federated-credentials.shscripts/bootstrap.ps1scripts/bootstrap.sh
- deploy.yaml: fix 'Runtime UAT toggle' comment to 'Runtime staging toggle' - request_to_token_attribution.md: update acceptance criteria status to 'Ready' with dependency notes, fix tense on infra dependency - PRD.md: replace Dev/UAT/Prod with Dev/staging/Prod, fix M2 milestone - README.md: fix 'UAT toggle' to 'staging toggle'
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/deploy.yaml (1)
327-338: Consider extracting deploy job boilerplate into a reusable workflow/composite action.Adding
deploy-stagingincreases duplication withdeploy-dev/deploy-prodand raises drift risk for future edits (quickcheck, login, init/plan/apply, smoke tests).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy.yaml around lines 327 - 338, The deploy-staging job duplicates logic present in deploy-dev/deploy-prod; extract the shared steps (quickcheck, login, init/plan/apply, smoke tests) into a reusable workflow or composite action and call it from the deploy-staging, deploy-dev, and deploy-prod jobs; create a new reusable workflow/composite (e.g., deploy-workflow or deploy-composite) that accepts inputs like TF_VAR_env and PR label conditions, move the common run steps there, and update the deploy-staging job (and the other deploy-* jobs) to invoke that reusable workflow via uses: or workflow_call so only environment-specific conditionals (name, environment, env.TF_VAR_env, if condition) remain in each job.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 233-235: The doc uses two conflicting phrases — "pass correlation
metadata in request body" and "must pass correlation headers" — causing
ambiguity; choose one canonical requirement (preferably "pass correlation
metadata in request body") and update every occurrence of the other phrase so
the integration requirement for cognitive-mesh is consistent, including the
pending item line that currently lists "Pass correlation metadata in request
body" and the earlier sentence that currently reads "must pass correlation
headers", plus any related sections or examples that describe the required
integration mode (search for those exact strings to find and replace).
- Line 221: Replace the claim "Application Insights being added for trace
storage" with the clarified sentence: "Application Insights resource +
connection string wiring added; actual trace export requires custom LiteLLM
image deployment or explicit OTLP collector configuration." Also ensure the note
references the existing symbols/vars: APPLICATIONINSIGHTS_CONNECTION_STRING, the
comment about the custom LiteLLM image including azure-monitor-opentelemetry,
and the default empty OTEL_EXPORTER_OTLP_ENDPOINT so readers see that trace
export is not yet operational without those pieces.
---
Nitpick comments:
In @.github/workflows/deploy.yaml:
- Around line 327-338: The deploy-staging job duplicates logic present in
deploy-dev/deploy-prod; extract the shared steps (quickcheck, login,
init/plan/apply, smoke tests) into a reusable workflow or composite action and
call it from the deploy-staging, deploy-dev, and deploy-prod jobs; create a new
reusable workflow/composite (e.g., deploy-workflow or deploy-composite) that
accepts inputs like TF_VAR_env and PR label conditions, move the common run
steps there, and update the deploy-staging job (and the other deploy-* jobs) to
invoke that reusable workflow via uses: or workflow_call so only
environment-specific conditionals (name, environment, env.TF_VAR_env, if
condition) remain in each job.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 615a4d0d-704f-4f4e-a5d2-153645401c33
📒 Files selected for processing (4)
.github/workflows/deploy.yamlREADME.mddocs/PRD.mddocs/planning/request_to_token_attribution.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/PRD.md
- docs/architecture/README.md: SLM characteristics, patterns, advantages, limitations - docs/guides/README.md: Practical guidance on when to use SLM vs LLM, implementation patterns
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/guides/README.md (1)
68-68: Consider minor rewording for style.The phrase "In many cases" is commonly used. Consider alternatives like "Often" or "Frequently" for variety.
✏️ Optional style improvement
-In many cases an SLM + RAG outperforms a large model with no context. +Often an SLM + RAG outperforms a large model with no context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/README.md` at line 68, Replace the sentence "In many cases an SLM + RAG outperforms a large model with no context." with a tighter phrasing using an alternative to "In many cases" (e.g., "Often, an SLM + RAG outperforms a large model with no context." or "Frequently, an SLM + RAG outperforms a large model with no context."), preserving the "SLM + RAG" wording and punctuation; ensure spacing and comma after the adverb are correct and that the rest of the sentence remains unchanged.docs/architecture/README.md (1)
30-75: Add language specifiers to fenced code blocks for linting compliance.The ASCII diagrams in fenced code blocks should specify a language (or 'text') to satisfy markdown linting rules.
🔧 Proposed fix for linting compliance
For the Cascade Architecture diagram (lines 30-38):
-``` +```text SLM ↓ confidence high Return resultApply the same pattern to other ASCII diagrams at:
- Lines 44-50 (Router + Specialists)
- Lines 56-61 (Local-First AI)
- Lines 67-75 (Typical Modern AI Stack)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/README.md` around lines 30 - 75, The fenced ASCII diagrams (the blocks containing "SLM" / "↓ confidence high" / "Return result", the "Router (SLM)" block with "Code model" etc., the "Device" block with "SLM" and "embeddings", and the "User Request" -> "Router (SLM)" diagram) need explicit language specifiers for markdown linting; update each triple-backtick fence that currently starts with ``` to use ```text (e.g., change ``` to ```text) for the Cascade Architecture ("SLM"/"↓ confidence high"), Router + Specialists ("Router (SLM)"), Local-First AI ("Device"), and Typical Modern AI Stack ("User Request"/"Router (SLM)") blocks so linters accept the files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/architecture/README.md`:
- Around line 30-75: The fenced ASCII diagrams (the blocks containing "SLM" / "↓
confidence high" / "Return result", the "Router (SLM)" block with "Code model"
etc., the "Device" block with "SLM" and "embeddings", and the "User Request" ->
"Router (SLM)" diagram) need explicit language specifiers for markdown linting;
update each triple-backtick fence that currently starts with ``` to use ```text
(e.g., change ``` to ```text) for the Cascade Architecture ("SLM"/"↓ confidence
high"), Router + Specialists ("Router (SLM)"), Local-First AI ("Device"), and
Typical Modern AI Stack ("User Request"/"Router (SLM)") blocks so linters accept
the files.
In `@docs/guides/README.md`:
- Line 68: Replace the sentence "In many cases an SLM + RAG outperforms a large
model with no context." with a tighter phrasing using an alternative to "In many
cases" (e.g., "Often, an SLM + RAG outperforms a large model with no context."
or "Frequently, an SLM + RAG outperforms a large model with no context."),
preserving the "SLM + RAG" wording and punctuation; ensure spacing and comma
after the adverb are correct and that the rest of the sentence remains
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b54ca6a9-a1cb-453c-9cec-f60b1de83b9d
📒 Files selected for processing (2)
docs/architecture/README.mddocs/guides/README.md
…nt plan - docs/architecture/: Add README, ai-gateway, cognitive-mesh, phoenix-rooivalk, codeflow-engine, agentkit-forge, cross-system, and slm-management-plan - docs/guides/: Update index - Remove empty infra/env/uat directory
- ai-gateway.md: SLM as admission control and routing brain - cognitive-mesh.md: SLM as router, decompressor, state compressor - codeflow-engine.md: SLM as CI/CD triage - agentkit-forge.md: SLM as tool selector and execution governor - phoenix-rooivalk.md: SLM for reports only, NOT safety-critical - cross-system.md: Combined architecture with tiered model strategy - slm-management-plan.md: Updated with canonical principles - guides/README.md: Updated with practical patterns
…itecture documentation. This PR makes significant improvements to the architecture documentation by: 1. Creating a better organizational structure with separate folders for: - `systems/` - Individual system documentation - `reference/` - Detailed implementation matrices and technical specifications - `reference/strategic/` - Strategic guidance documents 2. Adding comprehensive documentation about SLM (Small Language Model) implementation across all systems: - Detailed implementation matrices for each system - Confidence threshold guidelines - Telemetry schemas - Contract shapes and API endpoints 3. Adding new architectural diagrams: - Deployment architecture - Trust boundaries - Observability architecture - Cross-system integration patterns 4. Adding documentation for a new system called "Mystira" (interactive story generation system) 5. Providing strategic guidance on SLM implementation order and best practices The PR significantly improves the documentation by providing practical implementation details while maintaining a consistent architectural vision across all systems. The reorganization makes the documentation more navigable and the new reference materials provide clear implementation guidance. This PR should be approved as it greatly enhances the architecture documentation with practical implementation details.
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (9)
docs/architecture/systems/cognitive-mesh.md (1)
146-146: Consider tightening wording in the tradeoff table.Line 146 uses “Very large token savings”; “Large token savings” reads cleaner and more precise for architecture docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/systems/cognitive-mesh.md` at line 146, Update the tradeoff table entry that currently reads "Very large token savings" to "Large token savings" to tighten wording; locate the table row containing the phrase "Very large token savings | Decomposition quality can bottleneck workflow" in cognitive-mesh.md and replace the left-side cell text only, leaving the rest of the row unchanged.docs/architecture/systems/ai-gateway.md (2)
7-24: Add language specifier to fenced code block.The ASCII architecture diagram should have a language specifier. Consider using
textorplaintextto satisfy markdown linting rules.📝 Proposed fix
-``` +```text Client Request │🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/systems/ai-gateway.md` around lines 7 - 24, The fenced ASCII diagram block in ai-gateway.md is missing a language specifier; update the opening triple-backtick that precedes the diagram (the block containing "Client Request" and "SLM Control Layer") to include a language tag such as text or plaintext (e.g., change ``` to ```text) so the diagram satisfies markdown linting rules and is treated as plain text.
1-146: Consider documenting v1 API endpoint routing.Based on learnings, the gateway must route
/v1/responsesand/v1/embeddingsendpoints to Azure OpenAI using LiteLLM. The current documentation focuses on SLM admission control but doesn't mention these specific API endpoints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/systems/ai-gateway.md` around lines 1 - 146, The docs miss explicit v1 API routing rules; update the architecture doc to state that incoming /v1/responses and /v1/embeddings requests must be routed to Azure OpenAI via the LiteLLM provider. Edit the "Routing Logic" section (update gateway_admission and/or add a short route_to_provider example) to show mapping of those endpoints to Azure OpenAI/LiteLLM, add a note in the "Implementation Checklist" to configure LiteLLM Azure provider for those endpoints, and include a short example/rule describing any differences in handling responses vs embeddings (e.g., batching, model selection).docs/architecture/systems/phoenix-rooivalk.md (1)
7-27: Add language specifier to fenced code block.The ASCII architecture diagram should have a language specifier such as
textto satisfy markdown linting rules.📝 Proposed fix
-``` +```text Sensors │🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/systems/phoenix-rooivalk.md` around lines 7 - 27, The fenced ASCII diagram currently lacks a language specifier; update the Markdown fenced code block that surrounds the ASCII architecture diagram (the block starting with ``` and containing "Sensors", "Rules + Signal Models + Fusion", "SLM Interpretation Layer", etc.) to include a language tag (e.g., add `text` after the opening backticks as ```text) so markdown linting rules are satisfied.docs/architecture/reference/slm-implementation-matrix.md (2)
18-27: Add language specifier to fenced code block.The directory structure should have a language specifier. Consider using
textorplaintext.📝 Proposed fix
-``` +```text reference/ ├── slm-implementation-matrix.md # This file🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/reference/slm-implementation-matrix.md` around lines 18 - 27, The fenced code block in slm-implementation-matrix.md lacks a language specifier which tools rely on for proper rendering; update the opening fence for the directory tree in the file (the block starting with ```) to include a language label such as text or plaintext so it reads e.g. ```text, ensuring the directory structure snippet renders correctly.
254-260: Add language specifier to fenced code block.The fallback pattern list should have a language specifier. Consider using
textorplaintext.📝 Proposed fix
-``` +```text 1. SLM timeout → Deterministic rules 2. Low confidence → LLM escalation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/reference/slm-implementation-matrix.md` around lines 254 - 260, The fenced code block containing the fallback pattern list should include a language specifier so syntax highlighters render it as plain text; update the triple-backtick fence around the list (the block starting with "1. SLM timeout → Deterministic rules") to use a language like "text" or "plaintext" (e.g., ```text) and keep the content unchanged.docs/architecture/04-observability-telemetry.md (2)
37-37: Clarify Application Insights vs Azure Monitor terminology.The documentation uses "Azure Monitor" generically, but the actual infrastructure provisions an Azure Application Insights resource and exposes
APPLICATIONINSIGHTS_CONNECTION_STRINGas the environment variable. Based on learnings from infra/modules/aigateway_aca/main.tf:336-345, the OTEL exporter specifically targets Application Insights.Consider being more specific: "Application Insights" instead of "Azure Monitor" for clarity, or add a note explaining that Application Insights is the Azure Monitor implementation used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/04-observability-telemetry.md` at line 37, The doc uses "Azure Monitor" generically (see node label I2[Azure Monitor]); update wording to specifically say "Application Insights" where the deployed resource and OTEL exporter target Application Insights, or add a clarifying note that Application Insights is the Azure Monitor implementation used and that the OTEL exporter uses the APPLICATIONINSIGHTS_CONNECTION_STRING env var; ensure references to the OTEL exporter and the environment variable (APPLICATIONINSIGHTS_CONNECTION_STRING) and the infrastructure intent (OTEL -> Application Insights) are explicit.
1-94: Consider documenting retention policies.The Application Insights resource has environment-specific retention (90 days for prod, 30 days for others) as shown in infra/modules/aigateway_aca/main.tf:239-247. This is an operational consideration that could benefit observability operators. Based on learnings from infra/modules/aigateway_aca/main.tf:239-247.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/04-observability-telemetry.md` around lines 1 - 94, Add a short "Retention policies" subsection under "Telemetry Architecture" documenting the Application Insights retention defaults used by the infra: production retention 90 days and non-production retention 30 days, note that these are environment-specific and where operators should look to change them (the Application Insights resource configuration), and recommend including this retention info in operational runbooks so observability operators can align expectations and costs; reference the "Application Insights" resource and the "Telemetry Architecture" section to locate where to add the note.docs/architecture/reference/strategic/07-deployment-model.md (1)
29-34: Consider aligning CodeFlow matrix wording with sibling reference docs.“PR triage, log analysis” is accurate but less specific than the established terms used elsewhere (e.g., PR classification, CI failure triage, release-note extraction). Tightening terminology here will reduce cross-doc ambiguity.
Proposed wording tweak
-| CodeFlow | PR triage, log analysis | Root cause across dependencies | +| CodeFlow | PR classification, CI failure triage | Root cause across dependencies |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/architecture/reference/strategic/07-deployment-model.md` around lines 29 - 34, Update the CodeFlow "Best SLM Jobs" cell to use the more specific terminology used across sibling docs: replace the phrase "PR triage, log analysis" in the CodeFlow table row with targeted terms such as "PR classification, CI failure triage, release-note extraction" so the matrix aligns with other references and reduces ambiguity when searching for CodeFlow capabilities.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/architecture/02-container-architecture.md`:
- Around line 19-20: The diagram declares the node C3[GitHub Webhooks] but never
connects it into the flow; add an edge from C3 to the appropriate downstream
node (e.g., CodeFlow or whichever node represents ingress/CI processing) so the
webhook is part of the flow. Update the same missing connections in the later
block (the repeated section around lines 58-83) so both occurrences of C3 are
connected (for example: C3 --> CodeFlow or C3 --> <IngressNodeName>), ensuring
the diagram reflects actual webhook ingress behavior.
In `@docs/architecture/04-observability-telemetry.md`:
- Around line 35-39: The Ingest subgraph is missing Prometheus as a telemetry
destination; update the diagram block (the Ingest subgraph containing
I1/OpenTelemetry, I2/Azure Monitor, I3/Blob Export) to add a fourth node for
Prometheus (e.g., I4[Prometheus]) and add the flow from the application metrics
endpoint (/metrics) into that node; also update the surrounding text to mention
that LiteLLM enables Prometheus callbacks (success_callback/failure_callback
include "prometheus") and reference the
infra/modules/aigateway_aca/main.tf:95-113 behavior so readers know Prometheus
is a primary telemetry sink alongside OTEL, Azure Monitor and Blob Export.
In `@docs/architecture/reference/matrix-gateway.md`:
- Around line 91-95: Replace the inconsistent confidence threshold text so the
doc uses the same policy values from the gateway SLM guide (0.90 / 0.75): update
the table row currently reading "confidence < 0.70" to the correct boundary
(e.g., "confidence < 0.75"), and ensure all other mentions of confidence-based
routing (the lines referencing escalation to LLM or fallback behavior) use the
0.90 / 0.75 thresholds consistently.
- Around line 42-52: The example response JSON does not match the declared
contract type ClassifyRequestOutput; update the example payload to use the
contract's field names (label and recommended_tier) and remove or rename
intent/recommended_target/recommended_model_tier accordingly so the example
reflects ClassifyRequestOutput's shape; ensure any referenced values (e.g.,
target engine or tier) are mapped into recommended_tier or documented separately
and apply the same fix to the other example block at lines 57-74.
In `@docs/architecture/reference/matrix-rooivalk.md`:
- Around line 110-113: The DEFAULT_THRESHOLDS constant contains
sop_suggestion.direct_suggest set to 0.78 which conflicts with the documented
threshold table stating top-tier behavior is >= 0.80; update the
DEFAULT_THRESHOLDS (symbol DEFAULT_THRESHOLDS) entry for
sop_suggestion.direct_suggest to 0.80 and verify any other related keys
(operator_summary, sop_suggestion.manual_lookup) and the summary table values
are consistent so the code constant and documentation table match.
- Around line 27-44: The fenced safety ASCII-art block in matrix-rooivalk.md is
unlabeled, triggering MD040; update the block fence to include a language hint
(e.g., add "text" after the opening ```) so the safety boundary block becomes a
labeled fenced code block and stops triggering the lint rule while preserving
the ASCII-art content.
In `@docs/architecture/reference/slm-management-plan.md`:
- Around line 43-56: The Markdown file contains unlabeled fenced code blocks
(e.g., the "Cost Control Layers" ASCII-art block, the "Request" ASCII-art block,
and the "Discovery → Testing → Staging → Production → Deprecated → Retired"
block) which trigger MD040; update each fenced block by adding a language label
(use "text") to the opening triple-backticks so the blocks become fenced with
```text to satisfy markdown linting and ensure consistent rendering.
- Around line 266-274: The checklist contains a duplicate/completed item "Add
explicit safety boundary for PhoenixRooivalk" in the tail checklist; remove that
line or change it to a follow-up task (e.g., "Verify existing PhoenixRooivalk
safety boundary and close gaps") so it does not appear as an unchecked action;
check for and reconcile with the existing safety-boundary entries already
documented in the earlier "safety boundary" section and the other mention so the
checklist reflects only outstanding work.
In `@docs/architecture/systems/agentkit-forge.md`:
- Around line 7-24: The code block in the architecture diagram is missing a
fence language which triggers markdownlint MD040; update the opening
triple-backtick for the ASCII diagram to include a language (e.g., change ``` to
```text or ```mermaid if you convert to mermaid) so the fenced block around the
diagram (the block that begins with the three backticks and contains "Agent Task
│ ▼ ... └─→ LLM Synthesis") specifies a language and satisfies the linter.
In `@docs/architecture/systems/codeflow-engine.md`:
- Around line 7-24: The fenced architecture diagram block is missing a language
tag; update the triple-backtick fence that opens the diagram (the ASCII flow
block shown under "Git Push / PR Event") to use the language identifier `text`
(i.e., change ``` to ```text) so markdownlint MD040 is satisfied while
preserving the diagram content.
- Around line 94-100: The f-string assigned to variable prompt embeds literal
JSON braces which must be escaped; in the prompt construction (variable prompt
using change_type and impacted_files) double the outer curly braces around the
JSON-like object (e.g., replace { "run_unit": ... } with {{ "run_unit": ... }})
so Python treats them as literal braces in the f-string rather than expression
delimiters.
In `@docs/architecture/systems/cognitive-mesh.md`:
- Around line 7-29: The fenced ASCII diagram block starting with ``` should
include a language identifier to satisfy markdown lint MD040; edit the block
that begins with the triple backticks before "User Query" and change it to start
with ```text so the diagram (the SLM Control Fabric / Routing Decision / Code
Agent / Research Agent section) is fenced as ```text.
---
Nitpick comments:
In `@docs/architecture/04-observability-telemetry.md`:
- Line 37: The doc uses "Azure Monitor" generically (see node label I2[Azure
Monitor]); update wording to specifically say "Application Insights" where the
deployed resource and OTEL exporter target Application Insights, or add a
clarifying note that Application Insights is the Azure Monitor implementation
used and that the OTEL exporter uses the APPLICATIONINSIGHTS_CONNECTION_STRING
env var; ensure references to the OTEL exporter and the environment variable
(APPLICATIONINSIGHTS_CONNECTION_STRING) and the infrastructure intent (OTEL ->
Application Insights) are explicit.
- Around line 1-94: Add a short "Retention policies" subsection under "Telemetry
Architecture" documenting the Application Insights retention defaults used by
the infra: production retention 90 days and non-production retention 30 days,
note that these are environment-specific and where operators should look to
change them (the Application Insights resource configuration), and recommend
including this retention info in operational runbooks so observability operators
can align expectations and costs; reference the "Application Insights" resource
and the "Telemetry Architecture" section to locate where to add the note.
In `@docs/architecture/reference/slm-implementation-matrix.md`:
- Around line 18-27: The fenced code block in slm-implementation-matrix.md lacks
a language specifier which tools rely on for proper rendering; update the
opening fence for the directory tree in the file (the block starting with ```)
to include a language label such as text or plaintext so it reads e.g. ```text,
ensuring the directory structure snippet renders correctly.
- Around line 254-260: The fenced code block containing the fallback pattern
list should include a language specifier so syntax highlighters render it as
plain text; update the triple-backtick fence around the list (the block starting
with "1. SLM timeout → Deterministic rules") to use a language like "text" or
"plaintext" (e.g., ```text) and keep the content unchanged.
In `@docs/architecture/reference/strategic/07-deployment-model.md`:
- Around line 29-34: Update the CodeFlow "Best SLM Jobs" cell to use the more
specific terminology used across sibling docs: replace the phrase "PR triage,
log analysis" in the CodeFlow table row with targeted terms such as "PR
classification, CI failure triage, release-note extraction" so the matrix aligns
with other references and reduces ambiguity when searching for CodeFlow
capabilities.
In `@docs/architecture/systems/ai-gateway.md`:
- Around line 7-24: The fenced ASCII diagram block in ai-gateway.md is missing a
language specifier; update the opening triple-backtick that precedes the diagram
(the block containing "Client Request" and "SLM Control Layer") to include a
language tag such as text or plaintext (e.g., change ``` to ```text) so the
diagram satisfies markdown linting rules and is treated as plain text.
- Around line 1-146: The docs miss explicit v1 API routing rules; update the
architecture doc to state that incoming /v1/responses and /v1/embeddings
requests must be routed to Azure OpenAI via the LiteLLM provider. Edit the
"Routing Logic" section (update gateway_admission and/or add a short
route_to_provider example) to show mapping of those endpoints to Azure
OpenAI/LiteLLM, add a note in the "Implementation Checklist" to configure
LiteLLM Azure provider for those endpoints, and include a short example/rule
describing any differences in handling responses vs embeddings (e.g., batching,
model selection).
In `@docs/architecture/systems/cognitive-mesh.md`:
- Line 146: Update the tradeoff table entry that currently reads "Very large
token savings" to "Large token savings" to tighten wording; locate the table row
containing the phrase "Very large token savings | Decomposition quality can
bottleneck workflow" in cognitive-mesh.md and replace the left-side cell text
only, leaving the rest of the row unchanged.
In `@docs/architecture/systems/phoenix-rooivalk.md`:
- Around line 7-27: The fenced ASCII diagram currently lacks a language
specifier; update the Markdown fenced code block that surrounds the ASCII
architecture diagram (the block starting with ``` and containing "Sensors",
"Rules + Signal Models + Fusion", "SLM Interpretation Layer", etc.) to include a
language tag (e.g., add `text` after the opening backticks as ```text) so
markdown linting rules are satisfied.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e78eaef1-6814-4002-94c9-403bcb72c4a8
📒 Files selected for processing (38)
docs/architecture/01-system-context.mddocs/architecture/02-container-architecture.mddocs/architecture/03-deployment-trust-boundaries.mddocs/architecture/04-observability-telemetry.mddocs/architecture/05-slm-llm-decision-flow.mddocs/architecture/06-shared-contracts.mddocs/architecture/07-repo-ownership-map.mddocs/architecture/README.mddocs/architecture/reference/c4-architecture.mddocs/architecture/reference/contracts.mddocs/architecture/reference/cross-system.mddocs/architecture/reference/dashboards.mddocs/architecture/reference/deployment-observability.mddocs/architecture/reference/matrix-agentkit.mddocs/architecture/reference/matrix-codeflow.mddocs/architecture/reference/matrix-cognitive-mesh.mddocs/architecture/reference/matrix-gateway.mddocs/architecture/reference/matrix-mystira.mddocs/architecture/reference/matrix-rooivalk.mddocs/architecture/reference/operations-patterns.mddocs/architecture/reference/slm-implementation-matrix.mddocs/architecture/reference/slm-management-plan.mddocs/architecture/reference/strategic/01-why-slms-matter.mddocs/architecture/reference/strategic/02-gateway-slm-use-cases.mddocs/architecture/reference/strategic/03-cognitive-mesh-use-cases.mddocs/architecture/reference/strategic/04-codeflow-use-cases.mddocs/architecture/reference/strategic/05-agentkit-use-cases.mddocs/architecture/reference/strategic/06-rooivalk-use-cases.mddocs/architecture/reference/strategic/07-deployment-model.mddocs/architecture/reference/strategic/08-implementation-order.mddocs/architecture/reference/strategic/README.mddocs/architecture/systems/agentkit-forge.mddocs/architecture/systems/ai-gateway.mddocs/architecture/systems/codeflow-engine.mddocs/architecture/systems/cognitive-mesh.mddocs/architecture/systems/mystira.mddocs/architecture/systems/phoenix-rooivalk.mddocs/guides/README.md
✅ Files skipped from review due to trivial changes (13)
- docs/architecture/06-shared-contracts.md
- docs/architecture/05-slm-llm-decision-flow.md
- docs/architecture/07-repo-ownership-map.md
- docs/architecture/reference/matrix-mystira.md
- docs/architecture/reference/strategic/03-cognitive-mesh-use-cases.md
- docs/architecture/reference/strategic/01-why-slms-matter.md
- docs/architecture/reference/matrix-agentkit.md
- docs/architecture/reference/contracts.md
- docs/architecture/reference/matrix-codeflow.md
- docs/architecture/reference/strategic/08-implementation-order.md
- docs/architecture/reference/cross-system.md
- docs/architecture/reference/operations-patterns.md
- docs/architecture/reference/matrix-cognitive-mesh.md
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/guides/README.md
- docs/architecture/README.md
| C3[GitHub Webhooks] | ||
| C4[Operator Console] |
There was a problem hiding this comment.
Connect GitHub Webhooks in the container flow.
Line 19 defines C3[GitHub Webhooks], but it is never connected downstream. Please add the expected edge (likely into CodeFlow) so the diagram reflects actual ingress behavior.
Suggested doc fix
C1 --> G1
C2 --> G1
C4 --> G1
+ C3 --> CF1Also applies to: 58-83
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/02-container-architecture.md` around lines 19 - 20, The
diagram declares the node C3[GitHub Webhooks] but never connects it into the
flow; add an edge from C3 to the appropriate downstream node (e.g., CodeFlow or
whichever node represents ingress/CI processing) so the webhook is part of the
flow. Update the same missing connections in the later block (the repeated
section around lines 58-83) so both occurrences of C3 are connected (for
example: C3 --> CodeFlow or C3 --> <IngressNodeName>), ensuring the diagram
reflects actual webhook ingress behavior.
| subgraph Ingest | ||
| I1[OpenTelemetry] | ||
| I2[Azure Monitor] | ||
| I3[Blob Export] | ||
| end |
There was a problem hiding this comment.
Document Prometheus metrics endpoint.
The actual LiteLLM configuration enables Prometheus callbacks (success_callback and failure_callback include "prometheus"), exposing metrics at /metrics. This is not reflected in the Ingest layer of the telemetry architecture diagram. Based on learnings from infra/modules/aigateway_aca/main.tf:95-113, Prometheus is a primary telemetry destination alongside OTEL.
📊 Suggested addition to the Ingest section
Consider adding Prometheus as a fourth ingest endpoint:
subgraph Ingest
I1[OpenTelemetry]
I2[Azure Monitor]
I3[Blob Export]
+ I4[Prometheus]
endAnd add the corresponding flow:
+ S5 --> I4
+ I4 --> A1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/04-observability-telemetry.md` around lines 35 - 39, The
Ingest subgraph is missing Prometheus as a telemetry destination; update the
diagram block (the Ingest subgraph containing I1/OpenTelemetry, I2/Azure
Monitor, I3/Blob Export) to add a fourth node for Prometheus (e.g.,
I4[Prometheus]) and add the flow from the application metrics endpoint
(/metrics) into that node; also update the surrounding text to mention that
LiteLLM enables Prometheus callbacks (success_callback/failure_callback include
"prometheus") and reference the infra/modules/aigateway_aca/main.tf:95-113
behavior so readers know Prometheus is a primary telemetry sink alongside OTEL,
Azure Monitor and Blob Export.
| ```json | ||
| { | ||
| "intent": "code_review", | ||
| "complexity": "medium", | ||
| "tool_candidate": true, | ||
| "recommended_target": "codeflow-engine", | ||
| "recommended_model_tier": "small", | ||
| "escalation_required": false, | ||
| "confidence": 0.93 | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Align example payload with declared contract shape.
The response example uses intent, recommended_target, and recommended_model_tier, but ClassifyRequestOutput defines label and recommended_tier. This mismatch makes the contract ambiguous for implementers.
Suggested doc fix (example aligned to contract)
{
- "intent": "code_review",
+ "label": "code_review",
"complexity": "medium",
"tool_candidate": true,
- "recommended_target": "codeflow-engine",
- "recommended_model_tier": "small",
- "escalation_required": false,
+ "recommended_tier": "small",
+ "cacheable": true,
"confidence": 0.93
}Also applies to: 57-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/reference/matrix-gateway.md` around lines 42 - 52, The
example response JSON does not match the declared contract type
ClassifyRequestOutput; update the example payload to use the contract's field
names (label and recommended_tier) and remove or rename
intent/recommended_target/recommended_model_tier accordingly so the example
reflects ClassifyRequestOutput's shape; ensure any referenced values (e.g.,
target engine or tier) are mapped into recommended_tier or documented separately
and apply the same fix to the other example block at lines 57-74.
| | Condition | Action | | ||
| | -------------------------------- | ---------------------- | | ||
| | `policy-screen.allowed == false` | Block or redact | | ||
| | `confidence < 0.70` | Escalate to LLM | | ||
| | Tool suggested but no mapping | Send to general LLM | |
There was a problem hiding this comment.
Use one confidence threshold policy across fallback/config/table.
Line 94 says escalate on confidence < 0.70, while Lines 102 and 109-111 imply the boundary is < 0.75. Please unify this to one policy value to avoid conflicting routing behavior.
Based on learnings from docs/architecture/reference/strategic/02-gateway-slm-use-cases.md:84-90, the threshold guide uses 0.90 / 0.75.
Also applies to: 102-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/reference/matrix-gateway.md` around lines 91 - 95, Replace
the inconsistent confidence threshold text so the doc uses the same policy
values from the gateway SLM guide (0.90 / 0.75): update the table row currently
reading "confidence < 0.70" to the correct boundary (e.g., "confidence < 0.75"),
and ensure all other mentions of confidence-based routing (the lines referencing
escalation to LLM or fallback behavior) use the 0.90 / 0.75 thresholds
consistently.
| ``` | ||
| ┌─────────────────────────────────────────────────────────┐ | ||
| │ IMPORTANT - SAFETY BOUNDARY │ | ||
| ├─────────────────────────────────────────────────────────┤ | ||
| │ Hard Decision Layer must NOT depend on free-form SLM │ | ||
| │ │ | ||
| │ SLM output is for OBSERVATION and REPORTING only: │ | ||
| │ • Operator summaries │ | ||
| │ • SOP suggestions (non-binding) │ | ||
| │ • Mission log condensation │ | ||
| │ │ | ||
| │ SLM must NEVER be used for: │ | ||
| │ • Autonomous threat response │ | ||
| │ • Access control decisions │ | ||
| │ • Resource isolation actions │ | ||
| │ • Any kinetic or hard control actions │ | ||
| └─────────────────────────────────────────────────────────┘ | ||
| ``` |
There was a problem hiding this comment.
Add a language hint to the fenced safety block.
Line 27 uses an unlabeled fenced block, which triggers MD040 and can break consistent markdown tooling behavior.
Proposed fix
-```
+```text
┌─────────────────────────────────────────────────────────┐
│ IMPORTANT - SAFETY BOUNDARY │
...
└─────────────────────────────────────────────────────────┘</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>
[warning] 27-27: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/architecture/reference/matrix-rooivalk.md around lines 27 - 44, The
fenced safety ASCII-art block in matrix-rooivalk.md is unlabeled, triggering
MD040; update the block fence to include a language hint (e.g., add "text" after
the opening ```) so the safety boundary block becomes a labeled fenced code
block and stops triggering the lint rule while preserving the ASCII-art content.
</details>
<!-- fingerprinting:phantom:triton:hawk -->
<!-- This is an auto-generated comment by CodeRabbit -->
| 1. [ ] Establish model registry with tiered selection | ||
| 2. [ ] Implement cost tracking per project | ||
| 3. [ ] Set up latency monitoring dashboards | ||
| 4. [ ] Create edge deployment pipeline | ||
| 5. [ ] Build security check pipeline | ||
| 6. [ ] Define fallback hierarchies | ||
| 7. [ ] Implement observability stack | ||
| 8. [ ] Document model lifecycle process | ||
| 9. [ ] **Add explicit safety boundary for PhoenixRooivalk** |
There was a problem hiding this comment.
Remove or reword the completed safety-boundary action item.
Line 274 says “Add explicit safety boundary for PhoenixRooivalk,” but this is already present earlier (Lines 94-107 and Line 242). Keeping it unchecked creates planning drift.
Proposed fix
-9. [ ] **Add explicit safety boundary for PhoenixRooivalk**
+9. [x] **Add explicit safety boundary for PhoenixRooivalk**🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/reference/slm-management-plan.md` around lines 266 - 274,
The checklist contains a duplicate/completed item "Add explicit safety boundary
for PhoenixRooivalk" in the tail checklist; remove that line or change it to a
follow-up task (e.g., "Verify existing PhoenixRooivalk safety boundary and close
gaps") so it does not appear as an unchecked action; check for and reconcile
with the existing safety-boundary entries already documented in the earlier
"safety boundary" section and the other mention so the checklist reflects only
outstanding work.
| ``` | ||
| Agent Task | ||
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────┐ | ||
| │ SLM Execution Governor │ | ||
| │ (tool selection, memory, budget) │ | ||
| └─────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| Tool Selection | ||
| │ | ||
| ├─→ GitHub API | ||
| ├─→ Azure CLI | ||
| ├─→ Terraform | ||
| ├─→ Documentation Search | ||
| └─→ LLM Synthesis | ||
| ``` |
There was a problem hiding this comment.
Add a fence language to satisfy markdownlint MD040.
The fenced block at Line 7 has no language. Use text (or mermaid if converted) to keep lint clean.
Suggested doc fix
-```
+```text
Agent Task
│
▼
...
└─→ LLM Synthesis</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 7-7: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/systems/agentkit-forge.md` around lines 7 - 24, The code
block in the architecture diagram is missing a fence language which triggers
markdownlint MD040; update the opening triple-backtick for the ASCII diagram to
include a language (e.g., change ``` to ```text or ```mermaid if you convert to
mermaid) so the fenced block around the diagram (the block that begins with the
three backticks and contains "Agent Task │ ▼ ... └─→ LLM Synthesis") specifies a
language and satisfies the linter.
| ``` | ||
| Git Push / PR Event | ||
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────┐ | ||
| │ SLM Triage Layer │ | ||
| │ (classification, risk, pipeline) │ | ||
| └─────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| CI/CD Decision | ||
| │ | ||
| ├─→ Auto approve | ||
| ├─→ Run tests (full/minimal/skip) | ||
| ├─→ Security review | ||
| └─→ Escalate to LLM | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Add a language for the fenced architecture block.
The code fence starting at Line 7 is missing a language tag; set it to text to satisfy markdownlint MD040.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 7-7: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/systems/codeflow-engine.md` around lines 7 - 24, The fenced
architecture diagram block is missing a language tag; update the triple-backtick
fence that opens the diagram (the ASCII flow block shown under "Git Push / PR
Event") to use the language identifier `text` (i.e., change ``` to ```text) so
markdownlint MD040 is satisfied while preserving the diagram content.
| prompt = f"""Select tests for this change: | ||
|
|
||
| Type: {change_type} | ||
| Files: {', '.join(impacted_files)} | ||
|
|
||
| Output: { "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }""" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify potential unescaped JSON braces inside Python f-strings in architecture docs.
rg -nP --type=md 'f"""[\s\S]*Output:\s*\{[^{}]*\}"""' docs/architectureRepository: phoenixvc/ai-gateway
Length of output: 46
🏁 Script executed:
cat -n docs/architecture/systems/codeflow-engine.md | sed -n '90,105p'Repository: phoenixvc/ai-gateway
Length of output: 560
Fix invalid Python f-string literal braces in the prompt example.
Line 99 embeds JSON braces directly in an f-string without escaping, which is invalid Python syntax. The curly braces in { "run_unit": bool, ... } must be doubled to {{ "run_unit": bool, ... }} to be treated as literal characters in the f-string. This code snippet will fail with a SyntaxError when copied into real code.
Suggested fix
-Output: { "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }"""
+Output: {{ "run_unit": bool, "run_integration": bool, "run_e2e": bool, "skip_reason": str|null }}"""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/architecture/systems/codeflow-engine.md` around lines 94 - 100, The
f-string assigned to variable prompt embeds literal JSON braces which must be
escaped; in the prompt construction (variable prompt using change_type and
impacted_files) double the outer curly braces around the JSON-like object (e.g.,
replace { "run_unit": ... } with {{ "run_unit": ... }}) so Python treats them as
literal braces in the f-string rather than expression delimiters.
| ``` | ||
| User Query | ||
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────┐ | ||
| │ SLM Control Fabric │ | ||
| │ (routing, decomposition, compression)│ | ||
| └─────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| Routing Decision | ||
| │ | ||
| ├─→ Code Agent | ||
| ├─→ Infra Agent | ||
| ├─→ Security Agent | ||
| └─→ Research Agent | ||
| │ | ||
| ▼ | ||
| Specialist Work | ||
| │ | ||
| ▼ | ||
| LLM Synthesis (only when needed) | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the architecture fenced block.
Line 7 starts a fenced block without a language tag, which triggers markdown linting (MD040). Use text for this ASCII diagram.
Suggested diff
-```
+```text
User Query
│
▼
┌─────────────────────────────────────┐
│ SLM Control Fabric │
│ (routing, decomposition, compression)│
└─────────────────────────────────────┘
│
▼
Routing Decision
│
├─→ Code Agent
├─→ Infra Agent
├─→ Security Agent
└─→ Research Agent
│
▼
Specialist Work
│
▼
LLM Synthesis (only when needed)</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>
[warning] 7-7: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/architecture/systems/cognitive-mesh.md around lines 7 - 29, The fenced
ASCII diagram block starting with should include a language identifier to satisfy markdown lint MD040; edit the block that begins with the triple backticks before "User Query" and change it to start withtext so the diagram
(the SLM Control Fabric / Routing Decision / Code Agent / Research Agent
section) is fenced as ```text.
</details>
<!-- fingerprinting:phantom:triton:hawk -->
<!-- This is an auto-generated comment by CodeRabbit -->
Refactor deployment process by extracting environment-specific deployment logic into a reusable workflow. This reduces code duplication across dev/staging/prod environments and centralizes deployment configuration parameters.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
docs/planning/request_to_token_attribution.md (1)
125-155:⚠️ Potential issue | 🟡 MinorClarify “required” vs “alternative” integration path.
Line 125 says request-body metadata is required, but the section immediately presents headers as a second method. This still reads as conflicting requirement language.
✏️ Suggested wording tweak
-**Required:** Pass correlation metadata in request body when calling gateway. There are two methods: +**Required (current implementation):** Pass correlation metadata in request body when calling gateway. +**Alternative (future enhancement):** HTTP header-based correlation is possible only with additional middleware/wrapper support.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/planning/request_to_token_attribution.md` around lines 125 - 155, Change the ambiguous "Required:" label to clearly indicate that request-body metadata is the recommended/default integration (e.g., replace "Required:" with "Recommended/Preferred") and add one sentence after "Method B: Via HTTP Headers" stating that headers are an alternative option that requires additional LiteLLM configuration or middleware; reference the existing headings "Method A: Via Request Metadata (Recommended)" and "Method B: Via HTTP Headers" so the reader knows which sections to update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy-environment.yaml:
- Around line 53-66: The reusable workflow currently only sets a subset of env
vars (TF_VAR_azure_openai_*, TF_VAR_gateway_key, TF_VAR_codex_*, etc.), causing
caller-level
secrets/TF_BACKEND_*/AZURE_CLIENT_ID/AZURE_TENANT_ID/AZURE_SUBSCRIPTION_ID,
EXPECTED_AOAI_ENDPOINT_HOST and TF_VAR_state_service_container_image to be empty
at apply time and breaking guards; update the env block to explicitly pass all
required variables from inputs/secrets (including AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*,
TF_VAR_state_service_container_image, EXPECTED_AOAI_ENDPOINT_HOST and any other
Terraform TF_VAR_* the plan relies on) so plan and apply see the same inputs,
and replace the invalid GitHub Actions function syntax if(inputs.env_name ==
'prod', '3', '1') with the supported ternary-like expression inputs.env_name ==
'prod' && '3' || '1' to restore the Line 211 value evaluation and ensure the
prod AOAI guard and Line 215 smoke gate run as intended.
- Line 211: The models_wait_attempts assignment uses the unsupported if()
expression; update the expression for models_wait_attempts to use GitHub
Actions' case() function to select '3' when inputs.env_name == 'prod' otherwise
'1' (reference the models_wait_attempts key in the deploy-environment workflow);
ensure the case() expression compares inputs.env_name to 'prod' and returns the
correct string values, or alternatively use the safe conditional pattern
(condition && value1 || value2) only if value1 cannot be falsy.
In @.github/workflows/deploy.yaml:
- Around line 154-170: The job-level environment key must be removed from the
reusable-workflow caller jobs (e.g., deploy-dev, deploy-staging, deploy-prod)
because jobs that use `uses:` cannot include `environment:`; update each caller
to only pass the env via the `env_name` input (already present) and then add
`environment: ${{ inputs.env_name }}` inside the called workflow
`deploy-environment.yaml` (use the `inputs.env_name` value in that workflow's
job definition) so the environment selection is handled by the reusable workflow
rather than the callers.
In `@docs/planning/request_to_token_attribution.md`:
- Line 221: The doc and Terraform behavior conflict because a new Terraform
output application_insights_connection_string exposes a secret while the policy
mandates routing secrets through Key Vault and never outputting them; either
remove the Terraform output or mark it as sensitive and change the deployment to
store the connection string in Key Vault (and update any consumers to read from
Key Vault), then update this doc (and the related note at lines ~229) to state
the chosen approach (no outputs for secrets or sensitive output with Key
Vault-backed handling) and reference the application_insights_connection_string
symbol and Key Vault usage so readers know how to retrieve the secret securely.
---
Duplicate comments:
In `@docs/planning/request_to_token_attribution.md`:
- Around line 125-155: Change the ambiguous "Required:" label to clearly
indicate that request-body metadata is the recommended/default integration
(e.g., replace "Required:" with "Recommended/Preferred") and add one sentence
after "Method B: Via HTTP Headers" stating that headers are an alternative
option that requires additional LiteLLM configuration or middleware; reference
the existing headings "Method A: Via Request Metadata (Recommended)" and "Method
B: Via HTTP Headers" so the reader knows which sections to update.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1703970d-036d-4366-b9a3-a25421b2c360
📒 Files selected for processing (3)
.github/workflows/deploy-environment.yaml.github/workflows/deploy.yamldocs/planning/request_to_token_attribution.md
| env: | ||
| TF_VAR_env: ${{ inputs.env_name }} | ||
| TF_VAR_projname: "aigateway" | ||
| TF_VAR_location: "southafricanorth" | ||
| TF_VAR_location_short: "san" | ||
| TF_VAR_azure_openai_endpoint: ${{ secrets.AZURE_OPENAI_ENDPOINT }} | ||
| TF_VAR_azure_openai_api_key: ${{ secrets.AZURE_OPENAI_API_KEY }} | ||
| TF_VAR_azure_openai_embedding_endpoint: ${{ secrets.AZURE_OPENAI_EMBEDDING_ENDPOINT }} | ||
| TF_VAR_azure_openai_embedding_api_key: ${{ secrets.AZURE_OPENAI_EMBEDDING_API_KEY }} | ||
| TF_VAR_gateway_key: ${{ secrets.AIGATEWAY_KEY }} | ||
| TF_VAR_codex_model: ${{ inputs.codex_model }} | ||
| TF_VAR_codex_api_version: ${{ inputs.codex_api_version }} | ||
| TF_VAR_embedding_deployment: "text-embedding-3-large" | ||
| TF_VAR_embeddings_api_version: "2024-02-01" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Do reusable GitHub Actions workflows inherit the caller workflow's top-level env: values, or must those values be passed explicitly and rebound inside the called workflow?
💡 Result:
No—reusable (called) workflows do not inherit the caller workflow’s top-level env: values. GitHub explicitly states that “Any environment variables set in an env context defined at the workflow level in the caller workflow are not propagated to the called workflow.” [1]
So if the called workflow needs those values, you must pass them explicitly (typically via on: workflow_call: inputs: / jobs.<job_id>.with:) and then rebind/use them inside the called workflow. [1]
🏁 Script executed:
sed -n '211p' .github/workflows/deploy-environment.yamlRepository: phoenixvc/ai-gateway
Length of output: 142
Rebuild the full deploy env inside the reusable workflow and fix Line 211 syntax.
Reusable workflows do not inherit the caller's top-level env: values—they must be passed explicitly. This workflow only recreates the AOAI/gateway subset, leaving AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*, and other Terraform variables empty for Lines 79-137. The result is that apply executes with a different input set than the reviewed plan, TF_VAR_state_service_container_image stays empty so the Line 215 smoke gate never runs, and the prod AOAI guard at Lines 106-113 is inert because EXPECTED_AOAI_ENDPOINT_HOST is never exported. Additionally, Line 211 uses invalid syntax if(inputs.env_name == 'prod', '3', '1') — GitHub Actions has no if() function; use a ternary pattern instead: inputs.env_name == 'prod' && '3' || '1'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/deploy-environment.yaml around lines 53 - 66, The reusable
workflow currently only sets a subset of env vars (TF_VAR_azure_openai_*,
TF_VAR_gateway_key, TF_VAR_codex_*, etc.), causing caller-level
secrets/TF_BACKEND_*/AZURE_CLIENT_ID/AZURE_TENANT_ID/AZURE_SUBSCRIPTION_ID,
EXPECTED_AOAI_ENDPOINT_HOST and TF_VAR_state_service_container_image to be empty
at apply time and breaking guards; update the env block to explicitly pass all
required variables from inputs/secrets (including AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, TF_BACKEND_*,
TF_VAR_state_service_container_image, EXPECTED_AOAI_ENDPOINT_HOST and any other
Terraform TF_VAR_* the plan relies on) so plan and apply see the same inputs,
and replace the invalid GitHub Actions function syntax if(inputs.env_name ==
'prod', '3', '1') with the supported ternary-like expression inputs.env_name ==
'prod' && '3' || '1' to restore the Line 211 value evaluation and ensure the
prod AOAI guard and Line 215 smoke gate run as intended.
| aoai_api_key: ${{ env.TF_VAR_azure_openai_api_key }} | ||
| max_attempts: "3" | ||
| retry_sleep: ${{ inputs.smoke_retry_sleep }} | ||
| models_wait_attempts: ${{ if(inputs.env_name == 'prod', '3', '1') }} |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does GitHub Actions expression syntax support if()inside${{ }}expressions, or should ternary behavior be expressed with&&/||?
💡 Result:
GitHub Actions expressions do not provide an if() function you can call inside ${{ ... }}. ${{ ... }} supports literals, operators (including && / ||), and a fixed set of built-in functions—if() is not one of them. [1]
So for “ternary-like” value selection you have two options:
- Recommended (now supported): use the
case()expression function for real if/else behavior, e.g.
${{ case(condition, 'a', 'b') }}. [2][3] - Older workaround: use
condition && valueIfTrue || valueIfFalse(note this can misbehave ifvalueIfTrueis falsy like'',0, orfalse, because||will then fall through).&&/||are the supported logical operators. [1]
Sources:
[1] GitHub Docs – Operators in expressions (&&, ||)
[2] GitHub Docs – case() function reference and examples
[3] GitHub Changelog (Jan 29, 2026) – announcement of new case function
Replace the unsupported if() expression with the case() function.
GitHub Actions expressions do not support if() as a function inside ${{ }}. Use the case() expression function for conditional value selection:
Suggested fix
- models_wait_attempts: ${{ if(inputs.env_name == 'prod', '3', '1') }}
+ models_wait_attempts: ${{ case(inputs.env_name == 'prod', '3', '1') }}Alternatively, the condition && value1 || value2 pattern works for non-falsy values but has pitfalls if value1 is falsy.
🧰 Tools
🪛 actionlint (1.7.11)
[error] 211-211: undefined function "if". available functions are "always", "cancelled", "case", "contains", "endswith", "failure", "format", "fromjson", "hashfiles", "join", "startswith", "success", "tojson"
(expression)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/deploy-environment.yaml at line 211, The
models_wait_attempts assignment uses the unsupported if() expression; update the
expression for models_wait_attempts to use GitHub Actions' case() function to
select '3' when inputs.env_name == 'prod' otherwise '1' (reference the
models_wait_attempts key in the deploy-environment workflow); ensure the case()
expression compares inputs.env_name to 'prod' and returns the correct string
values, or alternatively use the safe conditional pattern (condition && value1
|| value2) only if value1 cannot be falsy.
| uses: ./.github/workflows/deploy-environment.yaml | ||
| with: | ||
| env_name: dev | ||
| tf_state_key: dev.terraform.tfstate | ||
| codex_model: gpt-5.3-codex | ||
| codex_api_version: 2025-04-01-preview | ||
| terraform_working_directory: infra/env/dev | ||
| smoke_retry_sleep: "10" | ||
| smoke_models_wait_sleep: "15" | ||
| include_aoai_host_check: false | ||
| secrets: | ||
| AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }} | ||
| AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }} | ||
| AZURE_OPENAI_EMBEDDING_ENDPOINT: ${{ secrets.AZURE_OPENAI_EMBEDDING_ENDPOINT }} | ||
| AZURE_OPENAI_EMBEDDING_API_KEY: ${{ secrets.AZURE_OPENAI_EMBEDDING_API_KEY }} | ||
| AIGATEWAY_KEY: ${{ secrets.AIGATEWAY_KEY }} | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does GitHub Actions allow environment:on a job that calls a reusable workflow viauses:? If not, what is the documented way to bind the called job to dev, staging, or prod?
💡 Result:
No. A job that calls a reusable workflow via jobs.<job_id>.uses: can only use a limited set of keywords, and environment is not one of them (so environment: on the calling job is not supported). [1]
Documented way to bind to dev / staging / prod:
- Pass the target environment name as an input from the caller (often via a matrix of
dev/stage/prod). [2] - Set
environment:inside the reusable workflow’s job, using that input (the environment name can be an expression and can use theinputscontext). [3] - This is also why environment secrets can’t be passed from the caller via
workflow_call(it doesn’t supportenvironment), so the environment needs to be selected in the called workflow’s job. [2]
Example:
Caller:
jobs:
deploy:
strategy:
matrix:
target: [dev, staging, prod]
uses: org/repo/.github/workflows/deploy.yml@main
with:
target: ${{ matrix.target }}[2]
Reusable workflow (deploy.yml):
on:
workflow_call:
inputs:
target:
required: true
type: string
jobs:
deploy:
runs-on: ubuntu-latest
environment:
name: ${{ inputs.target }}
steps:
- run: echo "Deploying to ${{ inputs.target }}"[3]
Sources: GitHub Docs on reusable-workflow keyword limits [1], reusable workflows + environment-secret note + matrix example [2], and workflow syntax for jobs.<job_id>.environment (name can use inputs) [3].
🏁 Script executed:
cd .github/workflows && head -n 220 deploy.yaml | tail -n 70Repository: phoenixvc/ai-gateway
Length of output: 2993
Remove environment: from all three reusable-workflow caller jobs.
Jobs that call a reusable workflow via uses: cannot use the environment: key. Currently, deploy-dev, deploy-staging, and deploy-prod all specify environment: at the job level alongside uses:, which violates GitHub Actions job schema. Move environment selection into the called workflow (deploy-environment.yaml) by setting environment: there with the input value (e.g., environment: ${{ inputs.env_name }}). The caller jobs already pass the environment name correctly via env_name input and only need the invalid environment: key removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/deploy.yaml around lines 154 - 170, The job-level
environment key must be removed from the reusable-workflow caller jobs (e.g.,
deploy-dev, deploy-staging, deploy-prod) because jobs that use `uses:` cannot
include `environment:`; update each caller to only pass the env via the
`env_name` input (already present) and then add `environment: ${{
inputs.env_name }}` inside the called workflow `deploy-environment.yaml` (use
the `inputs.env_name` value in that workflow's job definition) so the
environment selection is handled by the reusable workflow rather than the
callers.
| - cognitive-mesh: Pass correlation metadata in request body | ||
| - pvc-costops-analytics: Must create KQL queries for new event shape | ||
| - infra: May need custom LiteLLM container image or OTEL collector | ||
| - infra: Application Insights resource + APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; trace export requires custom LiteLLM image (with azure-monitor-opentelemetry) or explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration (currently empty by default) |
There was a problem hiding this comment.
Documented secret-handling policy conflicts with Terraform output behavior.
This doc says secrets should go through Key Vault and (elsewhere in this file) says to never output secrets from Terraform, but the PR summary indicates a new application_insights_connection_string Terraform output was added. Please reconcile this to avoid normalizing secret exposure via outputs.
🔒 Suggested doc correction (if output remains temporarily)
-- infra: Application Insights resource + APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; trace export requires custom LiteLLM image (with azure-monitor-opentelemetry) or explicit OTEL_EXPORTER_OTLP_ENDPOINT configuration (currently empty by default)
+- infra: Application Insights resource + Key Vault-backed APPLICATIONINSIGHTS_CONNECTION_STRING wiring added; avoid exposing connection strings as Terraform outputs, and use OTEL_EXPORTER_OTLP_ENDPOINT/custom image as required for export pathAlso applies to: 229-229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/planning/request_to_token_attribution.md` at line 221, The doc and
Terraform behavior conflict because a new Terraform output
application_insights_connection_string exposes a secret while the policy
mandates routing secrets through Key Vault and never outputting them; either
remove the Terraform output or mark it as sensitive and change the deployment to
store the connection string in Key Vault (and update any consumers to read from
Key Vault), then update this doc (and the related note at lines ~229) to state
the chosen approach (no outputs for secrets or sensitive output with Key
Vault-backed handling) and reference the application_insights_connection_string
symbol and Key Vault usage so readers know how to retrieve the secret securely.
Summary
Changes
Infrastructure
infra/modules/aigateway_aca/main.tf: Add Application Insights resource, Key Vault secret, and container app env varinfra/modules/aigateway_aca/outputs.tf: Add application_insights_connection_string outputDocumentation
Testing
terraform applyto create App Insights resourceRisk
Summary by CodeRabbit
New Features
Updates
Documentation
Chores